<html>

<!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       https://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<body>
Run <a href="https://hadoop.apache.org/">Hadoop</a> MapReduce jobs over
Avro data, with map and reduce functions written in Java.

<p>Avro data files do not contain key/value pairs as expected by
  Hadoop's MapReduce API, but rather just a sequence of values.  Thus
  we provide here a layer on top of Hadoop's MapReduce API.</p>

<p>In all cases, input and output paths are set and jobs are submitted
  as with standard Hadoop jobs:
 <ul>
   <li>Specify input files with {@link
   org.apache.hadoop.mapred.FileInputFormat#setInputPaths}</li>
   <li>Specify an output directory with {@link
   org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}</li>
   <li>Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}</li>
 </ul>
</p>

<p>For jobs whose input and output are Avro data files:
 <ul>
   <li>Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and
   {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's input and output schemas.</li>
   <li>Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
   this as your job's mapper with {@link
   org.apache.avro.mapred.AvroJob#setMapperClass}</li>
   <li>Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
   this as your job's reducer and perhaps combiner, with {@link
   org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
   org.apache.avro.mapred.AvroJob#setCombinerClass}</li>
 </ul>
</p>

<p>For jobs whose input is an Avro data file and which use an {@link
  org.apache.avro.mapred.AvroMapper}, but whose reducer is a non-Avro
  {@link org.apache.hadoop.mapred.Reducer} and whose output is a
  non-Avro format:
 <ul>
   <li>Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} with your
   job's input schema.</li>
   <li>Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
   this as your job's mapper with {@link
   org.apache.avro.mapred.AvroJob#setMapperClass}</li>
   <li>Implement {@link org.apache.hadoop.mapred.Reducer} and specify
   your job's reducer with {@link
   org.apache.hadoop.mapred.JobConf#setReducerClass}.  The input key
   and value types should be {@link org.apache.avro.mapred.AvroKey} and {@link
   org.apache.avro.mapred.AvroValue}.</li>
   <li>Optionally implement {@link org.apache.hadoop.mapred.Reducer} and
   specify your job's combiner with {@link
   org.apache.hadoop.mapred.JobConf#setCombinerClass}.  You will be unable to
   re-use the same Reducer class as the Combiner, as the Combiner will need
   input and output key to be {@link org.apache.avro.mapred.AvroKey}, and
   input and output value to be {@link org.apache.avro.mapred.AvroValue}.</li>
   <li>Specify your job's output key and value types {@link
   org.apache.hadoop.mapred.JobConf#setOutputKeyClass} and {@link
   org.apache.hadoop.mapred.JobConf#setOutputValueClass}.</li>
   <li>Specify your job's output format {@link
   org.apache.hadoop.mapred.JobConf#setOutputFormat}.</li>
 </ul>
</p>

<p>For jobs whose input is non-Avro data file and which use a
  non-Avro {@link org.apache.hadoop.mapred.Mapper}, but whose reducer
  is an {@link org.apache.avro.mapred.AvroReducer} and whose output is
  an Avro data file:
 <ul>
   <li>Set your input file format with {@link
   org.apache.hadoop.mapred.JobConf#setInputFormat}.</li>
   <li>Implement {@link org.apache.hadoop.mapred.Mapper} and specify
   your job's mapper with {@link
   org.apache.hadoop.mapred.JobConf#setMapperClass}.  The output key
   and value type should be {@link org.apache.avro.mapred.AvroKey} and
   {@link org.apache.avro.mapred.AvroValue}.</li>
   <li>Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
   this as your job's reducer and perhaps combiner, with {@link
   org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
   org.apache.avro.mapred.AvroJob#setCombinerClass}</li>
   <li>Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's output schema.</li>
 </ul>
</p>

<p>For jobs whose input is non-Avro data file and which use a
  non-Avro {@link org.apache.hadoop.mapred.Mapper} and no reducer,
  i.e., a <i>map-only</i> job:
 <ul>
   <li>Set your input file format with {@link
   org.apache.hadoop.mapred.JobConf#setInputFormat}.</li>
   <li>Implement {@link org.apache.hadoop.mapred.Mapper} and specify
   your job's mapper with {@link
   org.apache.hadoop.mapred.JobConf#setMapperClass}.  The output key
   and value type should be {@link org.apache.avro.mapred.AvroWrapper} and
   {@link org.apache.hadoop.io.NullWritable}.</li>
   <li>Call {@link
   org.apache.hadoop.mapred.JobConf#setNumReduceTasks(int)} with zero.
   <li>Call {@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
   job's output schema.</li>
 </ul>
</p>

</body>
</html>
